Introduction¶
Dataset Description¶
This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Each entry contains the following information about a film:
- imdb_id: The unique identifier of the movie on the Internet Movie Database (IMDb).
- popularity: A measure of how popular the movie is.
- budget: The amount of money spent to produce the movie.
- revenue: The amount of money earned by the movie.
- original_title: The title of the movie as it was originally released, in its original language.
- cast: A list of the main actors and actresses who starred in the movie, separated by pipe characters(|).
- homepage: The URL of the official website of the movie.
- director: The name of the person who directed the movie.
- tagline: A short and catchy phrase that summarizes the main theme
- keywords: A list of words or phrases related to the movie separated by pipe characters(|).
- overview: A brief summary of the movie’s story.
- runtime: The duration of the movie, in minutes.
- genres: A list of the categories that the movie belongs to, such as comedy, drama, horror, etc., separated by pipe characters(|).
- production_companies: A list of the companies that were involved in the production of the movie, separated by pipe characters(|).
- release_date: The date when the movie was first released to the public, in the format MM-DD-YY.
- vote_count: The number of votes that the movie has received from users.
- vote_average: The average rating that the movie has received from users on a scale from 0 to 10.
- release_year: The year when the movie was first released in the format YYYY.
- budget_adj: The amount of money spent to produce the movie, adjusted for inflation, in 2010.
- revenue_adj: The amount of money earned by the movie from ticket sales, adjusted for inflation, in 2010.
Question(s) for Analysis¶
- Research Question 1 - What are the top 5 cast members by frequency in the dataset?
- Research Question 2 - What are the most and least frequent genres of movies in the dataset?
- Research Question 3 - Which genres are most popular from year to year?
- Research Question 4 - How many movies were released in each year in the dataset?
- Research Question 5 - What is the median popularity of movies by genres?
- Research Question 6 - Is there a statistically significant difference between median of the movie's popularity across genres?
- Research Question 7 - Does the runtime of a movie have a different impact on its popularity depending on its genre?
!python --version
Python 3.11.7
# Install libraries
# !pip install scikit-posthocs
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
import plotly.express as px
import ipywidgets as widgets
from IPython.display import display
import scikit_posthocs as sp
import plotly.io as pio
pio.renderers.default='notebook'
%matplotlib inline
# Display all columns in the dataframe
pd.set_option('display.max_columns', None)
# Set the display format for floats
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Set the figure size
plt.rcParams["figure.figsize"] = (10, 6)
# Set style for plots
sns.set_style("whitegrid")
# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")
pd.__version__
'2.1.4'
Data Wrangling¶
# Load csv
df = pd.read_csv('tmdb-movies.csv')
df.head()
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.986 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | monster|dna|tyrannosaurus rex|velociraptor|island | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
| 1 | 76341 | tt1392190 | 28.420 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | future|chase|post-apocalyptic|dystopia|australia | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.100 | 2015 | 137999939.280 | 348161292.489 |
| 2 | 262500 | tt2908446 | 13.113 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | based on novel|revolution|dystopia|sequel|dyst... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.300 | 2015 | 101199955.472 | 271619025.408 |
| 3 | 140607 | tt2488496 | 11.173 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | android|spaceship|jedi|space opera|3d | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.500 | 2015 | 183999919.040 | 1902723129.802 |
| 4 | 168259 | tt2820852 | 9.335 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | car race|speed|revenge|suspense|car | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.300 | 2015 | 174799923.088 | 1385748801.471 |
df.shape
(10866, 21)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB
# First look - Descriptive statistics
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 10866.000 | 66064.177 | 92130.137 | 5.000 | 10596.250 | 20669.000 | 75610.000 | 417859.000 |
| popularity | 10866.000 | 0.646 | 1.000 | 0.000 | 0.208 | 0.384 | 0.714 | 32.986 |
| budget | 10866.000 | 14625701.094 | 30913213.831 | 0.000 | 0.000 | 0.000 | 15000000.000 | 425000000.000 |
| revenue | 10866.000 | 39823319.793 | 117003486.582 | 0.000 | 0.000 | 0.000 | 24000000.000 | 2781505847.000 |
| runtime | 10866.000 | 102.071 | 31.381 | 0.000 | 90.000 | 99.000 | 111.000 | 900.000 |
| vote_count | 10866.000 | 217.390 | 575.619 | 10.000 | 17.000 | 38.000 | 145.750 | 9767.000 |
| vote_average | 10866.000 | 5.975 | 0.935 | 1.500 | 5.400 | 6.000 | 6.600 | 9.200 |
| release_year | 10866.000 | 2001.323 | 12.813 | 1960.000 | 1995.000 | 2006.000 | 2011.000 | 2015.000 |
| budget_adj | 10866.000 | 17551039.823 | 34306155.723 | 0.000 | 0.000 | 0.000 | 20853251.084 | 425000000.000 |
| revenue_adj | 10866.000 | 51364363.253 | 144632485.040 | 0.000 | 0.000 | 0.000 | 33697095.717 | 2827123750.412 |
Data Cleaning¶
User-defined functions¶
def replace0(df:pd.DataFrame,feature:str):
"""Replace 0s with NaNs in a dataframe column and returns the zero counts."""
before = df.query(f'{feature} == 0').shape[0]
df[feature].replace({0: np.nan}, inplace=True)
after = df.query(f'{feature} == 0').shape[0]
return f"the number of zeros in {feature} column : {before} -> {after}"
1. Drop unused columns¶
Drop unused columns that is not related to research questions.
- 'budget' and 'revenue' features exist in adjusted form in the dataset.
- 'imdb_id', 'homepage', 'tagline', 'overview', 'keywords' features are not related to research questions.
# Display a sample
df.sample(1)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 938 | 254024 | tt2967008 | 0.409 | 0 | 0 | The Face of an Angel | Kate Beckinsale|Cara Delevingne|Daniel Brühl|... | NaN | Michael Winterbottom | Forget the truth, find the story. | murder investigation | Both a journalist and a documentary filmmaker ... | 100 | Thriller|Drama | Revolution Films|BBC Films|Cattleya|Vedette Fi... | 9/6/14 | 28 | 4.200 | 2014 | 0.000 | 0.000 |
# Drop columns
drop_column_list = ['imdb_id','budget','revenue','homepage','tagline','overview','keywords']
display(df.shape)
df.drop(columns=drop_column_list,inplace=True)
display(df.shape)
(10866, 21)
(10866, 14)
# Display a sample
df.sample(1)
| id | popularity | original_title | cast | director | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8048 | 12633 | 0.558 | The Star Chamber | Michael Douglas|Hal Holbrook|Yaphet Kotto|Shar... | Peter Hyams | 109 | Action|Crime|Drama|Thriller | Twentieth Century Fox Film Corporation|Frank Y... | 8/5/83 | 15 | 5.100 | 1983 | 0.000 | 0.000 |
2. Remove duplicated columns¶
# Check if the Dataset have any Duplicate
df.duplicated().value_counts()
False 10865 True 1 Name: count, dtype: int64
# Drop Duplicates
df.drop_duplicates(inplace=True)
# Display shape
df.shape
(10865, 14)
3. Check missing values¶
# Display descriptive statistics
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 10865.000 | 66066.374 | 92134.092 | 5.000 | 10596.000 | 20662.000 | 75612.000 | 417859.000 |
| popularity | 10865.000 | 0.646 | 1.000 | 0.000 | 0.208 | 0.384 | 0.714 | 32.986 |
| runtime | 10865.000 | 102.072 | 31.383 | 0.000 | 90.000 | 99.000 | 111.000 | 900.000 |
| vote_count | 10865.000 | 217.400 | 575.645 | 10.000 | 17.000 | 38.000 | 146.000 | 9767.000 |
| vote_average | 10865.000 | 5.975 | 0.935 | 1.500 | 5.400 | 6.000 | 6.600 | 9.200 |
| release_year | 10865.000 | 2001.322 | 12.813 | 1960.000 | 1995.000 | 2006.000 | 2011.000 | 2015.000 |
| budget_adj | 10865.000 | 17549894.037 | 34307526.658 | 0.000 | 0.000 | 0.000 | 20853251.084 | 425000000.000 |
| revenue_adj | 10865.000 | 51369001.759 | 144638333.129 | 0.000 | 0.000 | 0.000 | 33701729.009 | 2827123750.412 |
In descriptive statistics of the dataset, it is seen that there are 0 values in runtime, budget_adj and revenue_adj. 0 values in these features will be considered as null values.
Popularity feature may have been calculated as 0.
# replace 0 with np.nan in the runtime column
replace0(df,'runtime')
'the number of zeros in runtime column : 31 -> 0'
# replace 0 with np.nan in the budget_adj
replace0(df,'budget_adj')
'the number of zeros in budget_adj column : 5696 -> 0'
# replace 0 with np.nan in the revenue_adj
replace0(df,'revenue_adj')
'the number of zeros in revenue_adj column : 6016 -> 0'
# Check the sum of Missing Values per column
df.isnull().sum()
id 0 popularity 0 original_title 0 cast 76 director 44 runtime 31 genres 23 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 5696 revenue_adj 6016 dtype: int64
# Check the Percentage of Missing Values
df.isnull().sum() / df.shape[0] * 100
id 0.000 popularity 0.000 original_title 0.000 cast 0.699 director 0.405 runtime 0.285 genres 0.212 production_companies 9.480 release_date 0.000 vote_count 0.000 vote_average 0.000 release_year 0.000 budget_adj 52.425 revenue_adj 55.370 dtype: float64
# Plot the Percentage of Missing Values
ax = (df.isnull().sum() / df.shape[0] * 100).plot(kind='barh')
plt.bar_label(ax.containers[0], fmt='%.2f%%');
4. Convert to datetime format¶
We need to convert release_date format to pd dateformat in order to make analysis using pandas.
# Display a sample
df.release_date.sample(1)
31 3/4/15 Name: release_date, dtype: object
# Convert str to date format
df['release_date'] = pd.to_datetime(df['release_date'])
# Control a sample
df.release_date.loc[10561]
Timestamp('1986-09-23 00:00:00')
# List of columns which data types are object
df.select_dtypes(include=["object"]).columns
Index(['original_title', 'cast', 'director', 'genres', 'production_companies'], dtype='object')
# Descriptive statistics of object columns
df.select_dtypes(include=["object"]).describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| original_title | 10865 | 10571 | Hamlet | 4 |
| cast | 10789 | 10719 | Louis C.K. | 6 |
| director | 10821 | 5067 | Woody Allen | 45 |
| genres | 10842 | 2039 | Comedy | 712 |
| production_companies | 9835 | 7445 | Paramount Pictures | 156 |
# make a copy
df0 = df.copy()
Exploratory Data Analysis¶
User-defined functions¶
def explode(df_:pd.DataFrame,feature:str):
"""Split and expand a column of delimited values into multiple rows."""
df_temp = df_.copy()
df_temp[feature] = df_temp[feature].str.split('|')
return df_temp.explode(feature)
Research Question 1 - What are the top 5 cast members by frequency in the dataset?¶
# Get one sample
df.sample(1)
| id | popularity | original_title | cast | director | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6500 | 292 | 0.151 | Dave Chappelle's Block Party | Dave Chappelle|Erykah Badu|Common|Mos Def|Bilal | Michel Gondry | 100.000 | Comedy|Documentary|Music | Yari Film Group | 2005-09-12 | 15 | 7.000 | 2005 | 3349689.869 | 13084552.983 |
# Explode dataframe to have each genre as a seperate row
df_cast_explode = explode(df,'cast')
df_cast_explode.head(3)
| id | popularity | original_title | cast | director | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.986 | Jurassic World | Chris Pratt | Colin Trevorrow | 124.000 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
| 0 | 135397 | 32.986 | Jurassic World | Bryce Dallas Howard | Colin Trevorrow | 124.000 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
| 0 | 135397 | 32.986 | Jurassic World | Irrfan Khan | Colin Trevorrow | 124.000 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
# the number of null values
df_cast_explode.cast.isna().sum()
76
# Drop null values
display(df_cast_explode.shape)
df_cast_explode.dropna(subset=['cast'], inplace = True)
display(df_cast_explode.shape)
(52644, 14)
(52568, 14)
# the number of duplicated records
df_cast_explode.duplicated().sum()
19
# Drop duplicates
df_cast_explode.drop_duplicates(inplace=True)
display(df_cast_explode.shape)
(52549, 14)
# the unique number of cast members in the dataset
df_cast_explode.cast.nunique()
19026
There are 19026 different cast in the dataset
# Top 5 cast members
cast_counts = df_cast_explode['cast'].value_counts()[:5]
cast_counts
cast Robert De Niro 72 Samuel L. Jackson 71 Bruce Willis 62 Nicolas Cage 61 Michael Caine 53 Name: count, dtype: int64
# Visualization of top 5 members
ax = sns.barplot(x=cast_counts.index, y=cast_counts.values, palette = 'Blues_r')
plt.xlabel('Cast')
plt.ylabel('Count')
plt.title('Top 5 cast members by frequency')
plt.xticks(rotation=45)
for container in ax.containers:
plt.bar_label(container);
Conclusion: The top 5 cast members in the dataset by frequency are Robert De Niro(72), Samuel L. Jackson(71), Bruce Willis(62), Nicolas Cage(61) and Michael Caine(53)
Research Question 2 - What are the most and least frequent genres of movies in the dataset?¶
# Explode dataframe to have each genre as a seperate row
df_genres_explode = explode(df,'genres')
df_genres_explode.head(3)
| id | popularity | original_title | cast | director | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.986 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124.000 | Action | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
| 0 | 135397 | 32.986 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124.000 | Adventure | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
| 0 | 135397 | 32.986 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124.000 | Science Fiction | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.500 | 2015 | 137999939.280 | 1392445892.524 |
# Look at its shape
df_genres_explode.shape
(26978, 14)
# Genres in the dataset
df_genres_explode.genres.unique()
array(['Action', 'Adventure', 'Science Fiction', 'Thriller', 'Fantasy',
'Crime', 'Western', 'Drama', 'Family', 'Animation', 'Comedy',
'Mystery', 'Romance', 'War', 'History', 'Music', 'Horror',
'Documentary', 'TV Movie', nan, 'Foreign'], dtype=object)
# the number of null values in genres
df_genres_explode.genres.isna().sum()
23
# Drop nulls if exist in genres column
display(df_genres_explode.shape)
df_genres_explode.dropna(subset=['genres'], inplace = True)
display(df_genres_explode.shape)
(26978, 14)
(26955, 14)
# Genres count in the dataset
df_genres_explode.groupby('genres').count()[['id']]
| id | |
|---|---|
| genres | |
| Action | 2384 |
| Adventure | 1471 |
| Animation | 699 |
| Comedy | 3793 |
| Crime | 1354 |
| Documentary | 520 |
| Drama | 4760 |
| Family | 1231 |
| Fantasy | 916 |
| Foreign | 188 |
| History | 334 |
| Horror | 1637 |
| Music | 408 |
| Mystery | 810 |
| Romance | 1712 |
| Science Fiction | 1229 |
| TV Movie | 167 |
| Thriller | 2907 |
| War | 270 |
| Western | 165 |
# Genres count in the dataset sorted descending
df_genres_explode['genres'].value_counts()
genres Drama 4760 Comedy 3793 Thriller 2907 Action 2384 Romance 1712 Horror 1637 Adventure 1471 Crime 1354 Family 1231 Science Fiction 1229 Fantasy 916 Mystery 810 Animation 699 Documentary 520 Music 408 History 334 War 270 Foreign 188 TV Movie 167 Western 165 Name: count, dtype: int64
# the visualization of genres count in the dataset sorted descending
ax = sns.countplot(x='genres',data=df_genres_explode, order=df_genres_explode['genres'].value_counts().index, palette = 'Blues_r')
plt.xticks(rotation=45)
for container in ax.containers:
plt.bar_label(container);
Conclusion: the most frequent genre is Drama(4760), the lease frequent genre is Western(165).
Research Question 3 - Which genres are most popular from year to year?¶
# Genres popularity mean by year
df_genres_explode.groupby(['release_year','genres'])[['popularity']].mean().reset_index().head(20)
| release_year | genres | popularity | |
|---|---|---|---|
| 0 | 1960 | Action | 0.591 |
| 1 | 1960 | Adventure | 0.701 |
| 2 | 1960 | Comedy | 0.396 |
| 3 | 1960 | Crime | 0.346 |
| 4 | 1960 | Drama | 0.566 |
| 5 | 1960 | Family | 0.278 |
| 6 | 1960 | Fantasy | 0.428 |
| 7 | 1960 | Foreign | 0.195 |
| 8 | 1960 | History | 0.412 |
| 9 | 1960 | Horror | 0.591 |
| 10 | 1960 | Music | 0.424 |
| 11 | 1960 | Romance | 0.541 |
| 12 | 1960 | Science Fiction | 0.328 |
| 13 | 1960 | Thriller | 0.812 |
| 14 | 1960 | War | 0.226 |
| 15 | 1960 | Western | 0.568 |
| 16 | 1961 | Action | 0.366 |
| 17 | 1961 | Adventure | 0.767 |
| 18 | 1961 | Animation | 2.632 |
| 19 | 1961 | Comedy | 0.680 |
# Line plot of genres popularity mean by year
df2 = df_genres_explode.groupby(['release_year','genres'])[['popularity']].mean().reset_index()
ax = sns.lineplot(x='release_year',y='popularity',hue='genres',data=df2)
sns.move_legend(ax,'upper left', bbox_to_anchor=(1,1))
Due to the high number of unique genres, a separate graph will be drawn for each genre, after which an interactive version will be built to better understand the data.
g = sns.FacetGrid(df_genres_explode, col="genres", col_wrap=5, height=3, aspect=1.3)
g.map(sns.lineplot, "release_year", "popularity")
g.set_titles("{col_name}")
g.tight_layout()
<seaborn.axisgrid.FacetGrid at 0x23f2ef8e7d0>
Genres popularity from year to year.
# Create an interactive line plot with plotly express
fig = px.line(df2, x='release_year', y='popularity', color='genres', width= 1000, height = 600)
# Show the interactive plot
fig.show()
Interactive plot of genres popularity from year to year.
# Select the most popular genre of the years
idx = df2.groupby('release_year')['popularity'].idxmax()
# Select those rows from the original dataframe
most_popular = df2.loc[idx]
# What are the most popular genres each year?
most_popular
| release_year | genres | popularity | |
|---|---|---|---|
| 13 | 1960 | Thriller | 0.812 |
| 18 | 1961 | Animation | 2.632 |
| 34 | 1962 | Adventure | 0.943 |
| 52 | 1963 | Animation | 2.180 |
| 83 | 1964 | War | 0.931 |
| 95 | 1965 | Music | 0.969 |
| 105 | 1966 | Animation | 0.586 |
| 123 | 1967 | Animation | 1.349 |
| 149 | 1968 | Mystery | 1.519 |
| 159 | 1969 | Crime | 0.948 |
| 175 | 1970 | Animation | 1.128 |
| 199 | 1971 | Family | 1.531 |
| 215 | 1972 | Crime | 1.073 |
| 229 | 1973 | Animation | 0.957 |
| 257 | 1974 | Mystery | 0.702 |
| 265 | 1975 | Adventure | 0.880 |
| 285 | 1976 | Crime | 0.707 |
| 301 | 1977 | Action | 1.419 |
| 331 | 1978 | Music | 0.680 |
| 339 | 1979 | Action | 1.410 |
| 368 | 1980 | Science Fiction | 0.897 |
| 374 | 1981 | Adventure | 0.876 |
| 409 | 1982 | War | 1.143 |
| 412 | 1983 | Adventure | 0.901 |
| 437 | 1984 | Family | 0.824 |
| 455 | 1985 | Family | 0.924 |
| 467 | 1986 | Adventure | 0.799 |
| 491 | 1987 | History | 0.816 |
| 501 | 1988 | Action | 0.599 |
| 522 | 1989 | Animation | 1.178 |
| 540 | 1990 | Adventure | 0.802 |
| 561 | 1991 | Animation | 1.665 |
| 580 | 1992 | Animation | 1.287 |
| 605 | 1993 | Fantasy | 0.919 |
| 620 | 1994 | Crime | 1.298 |
| 638 | 1995 | Animation | 1.468 |
| 659 | 1996 | Crime | 0.977 |
| 689 | 1997 | Science Fiction | 1.140 |
| 711 | 1998 | War | 1.247 |
| 714 | 1999 | Adventure | 1.012 |
| 734 | 2000 | Adventure | 0.855 |
| 761 | 2001 | Fantasy | 1.565 |
| 781 | 2002 | Fantasy | 1.430 |
| 801 | 2003 | Fantasy | 1.748 |
| 821 | 2004 | Fantasy | 1.321 |
| 841 | 2005 | Fantasy | 1.118 |
| 861 | 2006 | Fantasy | 1.023 |
| 881 | 2007 | Fantasy | 0.957 |
| 894 | 2008 | Adventure | 1.008 |
| 914 | 2009 | Adventure | 1.138 |
| 933 | 2010 | Adventure | 1.360 |
| 971 | 2011 | Western | 1.176 |
| 991 | 2012 | Western | 1.733 |
| 993 | 2013 | Adventure | 1.261 |
| 1012 | 2014 | Adventure | 2.431 |
| 1031 | 2015 | Adventure | 3.284 |
Conclusion: The most popular genre in the last three years is adventure. the most popular genre was fantasy from 2001 to 2007.
Research Question 4 - How many movies were released in each year in the dataset?¶
# Get the number of movies by grouping according to release year.
df_year = df.groupby('release_year').size()
df_year.head(3)
release_year 1960 32 1961 31 1962 32 dtype: int64
# Line plot of number of movies across years.
sns.lineplot(x=df_year.index,y=df_year.values, label= 'Number of movies', color='green')
plt.xlabel("Release year")
plt.ylabel("Number of movies")
plt.title("Number of movies by release year")
plt.show()
# https://plotly.com/python/line-charts/
fig = px.line(df_year, x=df_year.index, y=df_year.values, labels={'y':'Number of movies'}, title='Number of movies by release year', width= 1000, height=600)
fig.update_layout(hovermode="x")
fig.show()
Conclusion: The number of movies increased gradually over the years.
Research Question 5 - What is the median popularity of movies by genres?¶
# The histogram of popularity
sns.histplot(x='popularity',kde=True, data=df, bins=20);
# Descriptive statistics of popularity
df.describe()[['popularity']].T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| popularity | 10865.000 | 0.646 | 0.000 | 0.208 | 0.384 | 0.714 | 32.986 | 1.000 |
# The boxplot of popularity
sns.boxplot(x='popularity', data=df)
<Axes: xlabel='popularity'>
# Histograms of popularity by genre
plt.figure(figsize=(20,15))
for count,genre in enumerate(df_genres_explode.genres.unique()):
plt.subplot(5,5,count + 1)
sns.histplot(x='popularity',kde=True, data=df_genres_explode.query('genres == @genre'), bins=20)
plt.title(genre)
plt.tight_layout()
# Genre popularity skewness
for count,genre in enumerate(df_genres_explode.genres.unique()):
print(f"{genre}: {df_genres_explode.query('genres == @genre').popularity.skew()}")
Action: 8.519559576213021 Adventure: 7.544246925782778 Science Fiction: 8.346574445070894 Thriller: 12.31808653463584 Fantasy: 3.3442974777303105 Crime: 4.141872354110435 Western: 6.173147812882716 Drama: 9.26647507842431 Family: 3.399000948901938 Animation: 3.2776512678707967 Comedy: 3.85839662551183 Mystery: 4.033738855312595 Romance: 3.0004699872067215 War: 4.07009677907743 History: 5.559047399136068 Music: 3.404349210716752 Horror: 5.209936328110234 Documentary: 1.7119359791142117 TV Movie: 2.8382606971691686 Foreign: 1.1906528356317898
All genre popularity skewness are right-skewed in the dataset. In this case looking at the medians of the popularity is more meaningful than looking at the means, as the medians are less affected by outliers.
# Looking at the mean and the median side by side
df_genres_explode.groupby('genres')[['popularity']].agg(['mean','median'])
| popularity | ||
|---|---|---|
| mean | median | |
| genres | ||
| Action | 0.926 | 0.470 |
| Adventure | 1.154 | 0.561 |
| Animation | 0.852 | 0.501 |
| Comedy | 0.593 | 0.389 |
| Crime | 0.745 | 0.476 |
| Documentary | 0.181 | 0.142 |
| Drama | 0.591 | 0.389 |
| Family | 0.787 | 0.470 |
| Fantasy | 0.993 | 0.530 |
| Foreign | 0.191 | 0.165 |
| History | 0.576 | 0.413 |
| Horror | 0.465 | 0.322 |
| Music | 0.487 | 0.322 |
| Mystery | 0.690 | 0.428 |
| Romance | 0.592 | 0.404 |
| Science Fiction | 1.002 | 0.440 |
| TV Movie | 0.271 | 0.231 |
| Thriller | 0.742 | 0.444 |
| War | 0.728 | 0.432 |
| Western | 0.591 | 0.328 |
# Median popularity of genres
df_median_popularity = df_genres_explode.groupby('genres')['popularity'].agg(['median'])
df_median_popularity
| median | |
|---|---|
| genres | |
| Action | 0.470 |
| Adventure | 0.561 |
| Animation | 0.501 |
| Comedy | 0.389 |
| Crime | 0.476 |
| Documentary | 0.142 |
| Drama | 0.389 |
| Family | 0.470 |
| Fantasy | 0.530 |
| Foreign | 0.165 |
| History | 0.413 |
| Horror | 0.322 |
| Music | 0.322 |
| Mystery | 0.428 |
| Romance | 0.404 |
| Science Fiction | 0.440 |
| TV Movie | 0.231 |
| Thriller | 0.444 |
| War | 0.432 |
| Western | 0.328 |
Conclusion: Popularity skewness are right skewed across genres.
Research Question 6 - Is there a statistically significant difference between median of the movie's popularity across genres?¶
# Make genre groups to be calculated in the statistical test
genre_groups = [group['popularity'].values for name,group in df_genres_explode.groupby('genres')]
# Apply Kruskal Wallis test to genre groups
statistic, p_value = stats.kruskal(*genre_groups)
# display Kruskal Wallis test results
print(f'Kruskal-Wallis Statistic: {statistic}')
print(f'p-value: {p_value}')
Kruskal-Wallis Statistic: 1586.9103452441982 p-value: 0.0
# Perform hyptoheses test
if p_value < 0.05:
print('There is a statistically significant difference in median popularity across genres')
else:
print('There is no statistically significant difference in median popularity across genres')
There is a statistically significant difference in median popularity across genres
# https://www.statology.org/dunns-test-python/
# To find out which groups have different medians after a Kruskal-Wallis test, we can use Dunn’s Test for pairwise comparisons.
posthoc_result = sp.posthoc_dunn(df_genres_explode,val_col='popularity',group_col='genres',p_adjust='bonferroni')
# dataframe that shows True if the difference is statistically significant, shows False if there is not statistically significant difference between medians of the pairs.
posthoc_result_p = posthoc_result < 0.05
# Heatmap of genre pairs showing whether there is a statistically significant difference between pair medians.
plt.figure(figsize=(20, 20))
sns.heatmap(posthoc_result_p, annot=True, cmap='Blues', fmt="", cbar=False)
plt.xlabel("Genres")
plt.ylabel("Genres")
plt.title("Post-hoc Dunn's test results");
# True means there is statistically significant difference between median popularity of the two genres.
# False means there is no statistically significant difference between median popularity of the two genres.
# Creating a new df to be used in plotly interactive hover template
# Create an empty numpy array with the same shape as df_median_popularity
new_array = np.empty((len(df_median_popularity), len(df_median_popularity)))
new_array = new_array.astype(str)
# Fill the new array with the value of the pairs
for i, x in enumerate(df_median_popularity.index):
for j, y in enumerate(df_median_popularity.index):
new_array[i][j] = f'({x}:{df_median_popularity.loc[x][0]} {y}:{df_median_popularity.loc[y][0]})'
# Convert the new array to a pandas dataframe
new_df = pd.DataFrame(new_array, index=df_median_popularity.index, columns=df_median_popularity.index)
# Display a sample
new_df.sample(1).iloc[0,0]
'(Documentary:0.1422399999999995 '
# Interactive heatmap showing median pairwise comparisons.
fig = px.imshow(posthoc_result < 0.05, color_continuous_scale='Blues')
fig.update_xaxes(side="top")
fig.update_layout(title="Post-hoc Dunn's test results", title_x=0.5, width=1000, height=1000)
fig.update_coloraxes(showscale=False)
fig.update_traces(text=new_df, texttemplate="%{text}")
fig.update_traces(hovertemplate='%{x} - %{y} <br> %{text}')
fig.show()
# Blue color means there is statistically significant difference between median popularity of the two genres.
# White color means there is no statistically significant difference between median popularity of the two genres.
Conclusion: There is a statistically significant difference in median popularity across genres.
Research Question 7 - Does the runtime of a movie have a different impact on its popularity depending on its genre?¶
# First look at the runtime-popularity scatterplot across all movies
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="runtime", y="popularity")
plt.title(f"All Movies")
plt.xlabel("Runtime")
plt.ylabel("Popularity")
plt.show();
# Display the runtime-popularity scatterplot - genres breakdown
plt.figure(figsize=(8, 6))
g = sns.FacetGrid(df_genres_explode, col="genres", col_wrap=5, height=3, aspect=1.3)
g.map(sns.scatterplot, "runtime", "popularity")
g.set_titles("{col_name}")
g.tight_layout();
<Figure size 800x600 with 0 Axes>
# Make it interactive. Create a dropdown widget for genre selection
genre_dropdown = widgets.Dropdown(
options=df_genres_explode["genres"].unique(),
description="Select Genre:",
disabled=False,
)
def plot_scatterplot(genre):
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_genres_explode[df_genres_explode["genres"] == genre],
x="runtime", y="popularity")
plt.title(f"{genre} Movies")
plt.xlabel("Runtime")
plt.ylabel("Popularity")
plt.xlim(0, df_genres_explode["runtime"].max())
plt.ylim(0, df_genres_explode["popularity"].max())
plt.show()
# Display the dropdown and plot the scatterplot
widgets.interactive(plot_scatterplot, genre=genre_dropdown)
Conclusion: According to the plots, it seems that the length of a film affects its popularity. It should be between 80 and 160. People do not seem to like watching longer films, especially over 200 minutes. There seem to be slight differences between genres.
Conclusions¶
The top 5 cast members in the dataset by frequency are Robert De Niro(72), Samuel L. Jackson(71), Bruce Willis(62), Nicolas Cage(61) and Michael Caine(53)
the most frequent genre is Drama(4760), the lease frequent genre is Western(165) in the dataset.
The most popular genre in the last three years is adventure. the most popular genre was fantasy from 2001 to 2007.
The number of movies increased gradually over the years.
Popularity skewness are right skewed across genres.
There is a statistically significant difference in median popularity across genres.
According to the plots, it seems that the length of a film affects its popularity. It should be between 80 and 160. People do not seem to like watching longer films, especially over 200 minutes. There seem to be slight differences between genres.
Limitations¶
The data may not be accurate or updated.
The budget and revenue columns do not specify the currency unit, which may vary depending on the country of production.
The Kruskal-Wallis test requires similar sample sizes, but some variation is acceptable. (https://www.statisticssolutions.com/kruskal-wallis-test/)
# !jupyter nbconvert --to html Investigate_a_Dataset.ipynb
# Embed widgets
# !jupyter nbconvert --to html --execute Investigate_a_Dataset.ipynb
# Save requirements.txt file
# !pigar generate